{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🪄 Optimize models for the ONNX Runtime\n",
    "### Task: Text Generation 📝\n",
    "\n",
    "In this notebook, you'll:\n",
    "\n",
    "1. Optimize Small Language Model(s) (SLMs) from a curated list. Model *architectures* include: Qwen, Llama, Phi, Gemma and Mistral.\n",
    "1. Inference the optimized SLM on the ONNX Runtime as part of a simple console chat application.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🐍 Python dependencies\n",
    "\n",
    "Install the following Python dependencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "%pip install olive-ai[auto-opt]\n",
    "%pip install transformers==4.44.2 onnxruntime-genai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ☑️ Select an SLM\n",
    "\n",
    "Select a model from the list below by uncommenting your model of choice and ensuring all other models are commented out.\n",
    "\n",
    "The list below is **not** an exhaustive list of text generation models supported by Olive and the ONNX Runtime. Instead, we have curated this list of models that are:\n",
    "\n",
    "- *\"small\"* (less than ~7B parameters) and \n",
    "- *\"popular\"* (i.e. either high trending/download/liked models on Hugging Face)\n",
    "\n",
    "You can optimize and inference other Generative AI models from the following model *architectures* using this notebook:\n",
    "\n",
    "1. Qwen\n",
    "1. Llama (includes Smol)\n",
    "1. Phi\n",
    "1. Gemma\n",
    "1. Mistral\n",
    "\n",
    "Other model architectures are also supported by Olive and the ONNX Runtime - for example [opt-125m](https://github.com/microsoft/Olive/tree/main/examples/opt_125m) and [falcon](https://github.com/microsoft/Olive/tree/main/examples/falcon) - However, they are are not yet supported in the ONNX Runtime Generate API and therefore require you to inference using lower-level APIs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ======================= QWEN MODELS ===========================\n",
    "# MODEL=\"Qwen/Qwen2.5-0.5B-Instruct\"\n",
    "MODEL=\"Qwen/Qwen2.5-1.5B-Instruct\"\n",
    "# MODEL=\"Qwen/Qwen2.5-3B-Instruct\"\n",
    "# MODEL=\"Qwen/Qwen2.5-7B-Instruct\"\n",
    "# MODEL=\"Qwen/Qwen2.5-Math-1.5B-Instruct\"\n",
    "# MODEL=\"Qwen/Qwen2.5-Coder-7B-Instruct\"\n",
    "#================================================================\n",
    "\n",
    "# ======================= LLAMA MODELS ==========================\n",
    "# MODEL=\"meta-llama/Llama-3.2-1B-Instruct\"\n",
    "# MODEL=\"meta-llama/Llama-3.2-3B-Instruct\"\n",
    "# MODEL=\"meta-llama/Meta-Llama-3-8B-Instruct\"\n",
    "# MODEL=\"meta-llama/CodeLlama-7b-hf\"\n",
    "# MODEL=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\n",
    "#================================================================\n",
    "\n",
    "# ======================= PHI MODELS ============================\n",
    "# MODEL=\"microsoft/Phi-3.5-mini-instruct\"\n",
    "# MODEL=\"microsoft/Phi-3-mini-128k-instruct\"\n",
    "# MODEL=\"microsoft/Phi-3-mini-4k-instruct\"\n",
    "#================================================================\n",
    "\n",
    "# ======================= SMOLLM2 MODELS ========================\n",
    "# MODEL=\"HuggingFaceTB/SmolLM2-135M-Instruct\"\n",
    "# MODEL=\"HuggingFaceTB/SmolLM2-360M-Instruct\"\n",
    "# MODEL=\"HuggingFaceTB/SmolLM2-1.7B-Instruct\"\n",
    "#================================================================\n",
    "\n",
    "# ======================= GEMMA MODELS ==========================\n",
    "# MODEL=\"google/gemma-2-2b-it\"\n",
    "# MODEL=\"google/gemma-2-9b-it\"\n",
    "#================================================================\n",
    "\n",
    "# ======================= MISTRAL MODELS ========================\n",
    "# MODEL=\"mistralai/Ministral-8B-Instruct-2410\"\n",
    "# MODEL=\"mistralai/Mistral-7B-Instruct-v0.3\"\n",
    "#================================================================\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 🤗 Login to Hugging Face\n",
    "To access models, you'll need to log-in to Hugging Face with a [user access token](https://huggingface.co/docs/hub/security-tokens). The following command will run you through the steps to login:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!huggingface-cli login"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 📇 Model card\n",
    "\n",
    "The code in the following cell gets some information on the selected model (such as license and number of downloads)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import huggingface_hub as hf\n",
    "\n",
    "m=hf.repo_info(MODEL)\n",
    "print(f\"Model Card :https://huggingface.co/{MODEL}\")\n",
    "print(f\"License: {m.card_data['license']}, {m.card_data['license_link']}\")\n",
    "print(f\"Number of downloads: {m.downloads}\")\n",
    "print(f\"Number of likes: {m.likes}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🪄 Run the Auto Optimizer\n",
    "\n",
    "Next, you'll execute Olive's automatic optimizer using the optimize CLI command, which will:\n",
    "\n",
    "1. Acquire the model from the the Hugging Face model repo.\n",
    "1. Quantize the model to `int4` using GPTQ.\n",
    "1. Capture the ONNX Graph and store the weights in an ONNX data file.\n",
    "1. Optimize the ONNX Graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!olive optimize \\\n",
    "    --model_name_or_path {MODEL} \\\n",
    "    --output_path models/{MODEL} \\\n",
    "    --precision int4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🧠 Inference model using ONNX Runtime\n",
    "\n",
    "The ONNX Runtime (ORT) is a fast and light-weight package (available in many programming languages) that runs cross-platform. ORT enables you to infuse your AI models into your applications so that inference is handled *on-device*. The following code creates a simple console-based chat interface that inferences your optimized model.\n",
    "\n",
    "### How to use\n",
    "\n",
    "The sample chat app to run is found as [model-chat.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/model-chat.py) in the [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai/) Github repository."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "olive-pinned",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
